A memetic algorithm for finding multiple subgraphs that optimally cover an input network

Finding dense subgraphs is a central problem in graph mining, with a variety of real-world application domains including biological analysis, financial market evaluation, and sociological surveys. While a series of studies have been devoted to finding subgraphs with maximum density, the problem of finding multiple subgraphs that best cover an input network has not been systematically explored. The present study discusses a variant of the densest subgraph problem and presents a mathematical model for optimizing the total coverage of an input network by extracting multiple subgraphs. A memetic algorithm that maximizes coverage is proposed and shown to be both effective and efficient. The method is applied to real-world networks. The empirical meaning of the optimal sampling method is discussed.


Introduction
Over the past several decades there has been substantial interest in studying social networks beyond the traditional social sciences while maintaining a focus on social structures. Specifically, instead of focusing on demographic attributes of a certain population, an increasing number of studies have focused on the structure of relationships that connect individual behaviors with collective dynamics [1]. One focus of the analysis of network structure has concerned cohesive subgraphs [2]. Notable examples of this work are sociometric cliques [3] and variants such as n-cliques, n-clan, k-plex, or k-core [4]. Related work has focused on detecting core/periphery structures [5], rich clubs [6] or communities [7]. Generally, the aim of these studies has been to find one or more subgraphs that maximizes some notion of density.
One popular notion of density that has been widely explored in the literature is the average degree (measured by edge-to-vertex ratio), and the problem of finding a subgraph that maximizes the average degree is called the densest subgraph problem (DSP) [8]. Analysis of the DSP has been applied to DNA analysis [9,10], financial market evaluation [11], social surveys [12,13], and theoretical computer science [14,15]. In the Web domain, Gibson et al. identified the link spams by extracting dense subgraphs in large graphs [16], which is one of the greatest challenges in evaluating search engine rankings [17]. In the social context, DSP was applied to expert team formation [15,18] as well as party organization [19,20]. Angel et al. detected realtime stories by searching for dense subgraphs in the entity co-occurrence graph constructed from micro-blogging streams [21]. DSP has been also employed to find teams with higher collaborative compatibility [22]. DSP aims at extracting a single subgraph, but many real-world cases seek a collection of dense subgraphs, such as communities or social stories [23]. There are relatively few studies in this direction, one of which, by Balalau et al., focused on finding a set of m subgraphs that maximizes the total density of each subgraph (denoted as the "multiple-m densest subgraphs problem", MmDSP) [23]. Variants of this model have been proposed subsequently [24,25]. These studies have solved the problem of how to extract multiple dense subgraphs, but the process of covering the input network by extracting multiple subgraphs has not been addressed. Maximizing the subgraph density and maximizing the covering have different social meanings. In many real-world cases, the density of subgraphs does not have to be large. For example, a collection of network surveys may not focus on how dense each investigated network is, but on how best to cover the whole population, which is a boundary specification problem [26,27]. In a network survey, self-report of social relationships is commonly used to collect network data. Specifically, given a list of participants, the data are obtained from answers to single-item questions that ask participants to enumerate individuals to whom they are connected by a direct relationship of a specified kind [1,28]. The main purpose of such a network survey is to best cover the interactional relationships. Besides network surveys, the covering problem can be also applied to influence maximization [29], network tomography [30], or pinning control [31].
The present study addresses the problem of how to find multiple subgraphs that best cover the input network. We call the problem "multiple-m covering k-subgraphs problem" (MmCkSP), i.e., maximizing the covering of the network edges given m subgraphs of limited size k. Unlike the classic graph partitioning problem and the densest subgraph problem, the present study aims to find how to multiply extract subgraphs that leads to the best coverage of the network relationships. Two illustrations that show the difference between MmCkSP and the densest subgraph problem are in Fig 1. Given the input network in Fig 1A, standard community detection finds the partitions of {1,2,3,4,5} and {6,7,8,9,10}. If we set the number of subgraphs to 3 and the subgraph size to 5, MmDSP may present the best solution as the partitions of {1,2,3,4,5}, {6,7,8,9,10} and {3,6,8,9,10} in order to maximize the density of each  extracted subgraph, while MmCkSP may extract the subgraphs {1,2,3,4,5}, {6,7,8,9,10} and  {1,3,5,7,8}, which may cover all the network edges even though the subgraph {1,3,5,7,8} is not dense. MmDSP and MmCkSP also extract different subgraphs in Fig 1B. Here the edges a, b in (a) and c in (b) are ignored using MmDSP, but these omitted edges connecting different communities may sometimes have a useful social interpretation. Covering these edges can help provide a better understanding of the input network structure.
In real world cases, the subgraph size and the number of subgraphs should be constrained because they are always associated with costs. Taking network surveys as an example, a larger nominalist of nodes makes the burden on respondents greater, in which case ties are more likely to be missed because respondents may not be able to recall enough to fully capture the network structure [32]. Here, we formalize MmCkSP as a new optimization problem, which goes beyond the conventional strategy of optimizing network density. An illustration of the optimization is shown in Fig 2. Given an input network consisting of six nodes and nine edges, if we constrain the subgraph size to 4, we can extract subgraphs {1, 2, 3, 4}, {1, 4, 5, 6} and {2, 3, 5, 6} that cover all ties in the entire population. Solution 2 extracts {1, 2, 3, 6} and {3, 4, 5, 6}, which can also include all the edges. Obviously, solution 2 in Fig 2 is more cost-effective than solution 1. Here, we design an algorithm that can find the most cost-effective solution.
The present study is organized as follows: related background, including the densest subgraph problem, and corresponding strategies for the problem with multiple subgraphs, including optimization models, are presented in section 2. In section 3, we propose a memetic algorithm that optimizes the covering problem for each subgraph. Experiments with the proposed algorithm on computer-generated and real-world networks are described in section 4. Section 5 presents the conclusion and discussion.

The densest subgraph problem and the solution approach
The densest subgraph problem (DSP) refers to how to obtain a list of members with the highest density. Given a graph G(V, E), where {v i } 2 V denotes the set of nodes and {e ij } 2 E denotes the set of relationships, DSP aims to find a subgraph G'(V', E') whose average density of G' computed as jE 0 j jV 0 j is the largest [8]. The optimization of DSP can then be formulated as (1), below. Solution of the DSP has been shown to require polynomial time [8,[33][34][35].
The average density of the extracted subgraph in DSP is associated with the subgraph size, and there is a tradeoff between the density and size [36]. From DSP, one may extract smaller subgraphs in sparser networks but extract larger subgraphs in denser networks. However, in real applications there always exists an upper bound for the subgraph, and one may constrain the size of dense subgraphs [36,37]. If all subgraphs have the same (bounded) size, the problem, which then becomes NP-hard [14,33], has been investigated under various names including the "k-cluster problem" [38][39][40], the "k-cardinality subgraph problem" [41], or the "densest k-subgraph problem" (DkSP) [42,43]. This problem is formulated as (2), below.

PLOS ONE
Finding multiple subgraphs that cover an input network Some variants of DkSP has been proposed. If the extracted subgraph is required to be connected, the problem is referred as to the densest connected k-subgraph problem (DCkSP) [44]. In weighted networks, finding the subgraph with k nodes that has the highest sum of the weights (edges) is called the "heaviest k-subgraph problem" (HkSP) [45]: DkSP actually has important interpretations in social science. A social problem related to DkSP is called the "boundary specification problem" (BSP), which aims to find a list of samples that best represents the population [26]. When nodes are excluded from the system, the observed network structure differs from the actual one. Simulations have examined features of missing actors and have shown the detrimental impact of incomplete sampling [27,28,46,47]. The similarity between the sample and the complete network declines as more nodes are excluded, and missing nodes substantially affect measures related to the complete network [28,46,47].
Solving DkSP can help to solve BSP, as illustrated in Fig 3. The input network consists of eight nodes, and we set the subgraph size k at 7. If we exclude node 5, then the network has a ring structure, which is quite different from the original structure. If node 6 is omitted by accident or for convenience, the whole network becomes unconnected. This example illustrates how a minor change in network structure can have a dramatic effect on inference about network properties as a whole [48]. Only for special cases can the sampled network have a similar structure to the complete network [49], while the solution that excludes node 8 can be the special case that is also the best solution of DkSP. If we exclude node 8, most of edges can be preserved because of the principle of largest density.
To solve DkSP, a number of studies have focused on the use of semidefinite programming; that is, the problem is transformed into a semidefinite programming problem for each node of a branch-and-bound tree [39,40]. Some semidefinite programming relaxations have been also used to approximate DkSP [50,51]. Other studies wrote DkSP as a problem of rank-constrained cardinality minimization, and relaxed it by the use of the nuclear norm [52,53]. Also, a series of heuristic algorithms have been employed in solving the problem. Kincaid proposed a simulated annealing algorithm and a tabu search algorithm to solve the NP-hard DkSP [54]. Macambira employed a tabu search algorithm which was shown to outperform greedy search [55]. A variable neighborhood search heuristic proposed by Brimberg et al. was shown to be effective in solving the DkSP [56].
From a sociological view, given that inappropriate boundary specification can have a detrimental effect on estimating the structure of a real population, a list of sampling methods related to the sampling in network surveys has been also proposed. For example, randomly selecting individuals is a common method of sampling in social science investigations [27,45,57]. Top-down sampling (choosing the top nodes ordered by size) has also been widely used and yields estimates of network properties that are highly consistent with those obtained from whole network analysis [58,59].

Covering problem with multiple graphs
Finding multiple densest subgraphs has recently been discussed [23][24][25]60]. Balalau et al. focused on finding a set of m subgraphs that maximize the total density of each subgraph with the constraint of an upper bound on the pairwise Jaccard coefficient between the sets of nodes of the subgraphs (denoted as "multiple-m densest subgraphs problem", MmDSP) [23]. Nasir et al. proposed a dynamic variant of this problem, where a collection of m disjoint subgraphs is found in a sliding window [25]. An approach similar to MmDSP was proposed by Galbrun et al. where the objective function takes both the total density and the distance between the subgraphs into account [24]. Dondi et al. addressed the approximability and computational complexity of this problem [60]. An application of MmDSP on dual networks has been also studied [61]. In this paper, we study the multiple-m densest subgraphs problem (MmDSP) proposed by Balalau et al. [23]. MmDSP aims to find a collection of m subgraphs for which the sum of the average density of each subgraph jE i j jV i j is maximized [23]. Optimization of MmDSP can be formulated as problem (3) below, where a is the upper bound on the pairwise Jaccard coefficient.

Maximize
MmDSP has focused on improving the density of each subgraph but has ignored the covering of the input network by extracting subgraphs. Although techniques such as the pairwise Jaccard coefficient or the distance between subgraphs have been invoked to avoid too much overlap between the extracted subgraphs [23,24], the literature still lacks a focus on the network covering problem. The present study aims to find an optimal method for finding multiple subgraphs that best cover the input network, denoted as MmCkSP. There are three key elements associated with the sampling process: the covering of the input network (C), the bound on the subgraph size (k), and the number of subgraphs (m). Given the size of each subgraph |V i |, practitioners need to assemble the collected ties into a network that can best cover the input network (C). The objective function of MmCkSP is then formulated as (4), below. When the limited number of subgraphs is 1, problem (4) can be transformed to problem (2).
Here we use the fraction of extracted edges to measure C, i.e., C(E1,E 2 ,. . .,E m ) = cover (. . .cover(cover( (3), C contains the physical significance of a (a parameter for avoiding subgraphs being too similar in problem (3)). Since the objective is to maximize the total covering of the input network, the extracted subgraphs should be different, and thus we do not necessarily employ a in problem (4).
The functional relationship between the three elements listed above is non-linear as can be seen from a simulation of the random sampling shown in Fig 5. We find that increasing subgraph size is more helpful in promoting representativeness than increasing the number of subgraphs, because the gradient dC dk is greater than dC dm .

Algorithm
Given a fixed number of subgraphs (m), subgraph size (k) and the entire population (N), the number of possible extracted subgraphs is Traversing all these solutions cannot be computed polynomial time, and thus MmCkSP constitutes an NP-hard problem. Compared with the NP-hard MmDSP proposed by Balalau et al. [23], MmCkSP is more complicated because of the higher time cost for computing the covering in place of the average density, as well as setting the bound k on subgraph size. In this section, we introduce a memetic algorithm that combines a genetic algorithm and a heuristic local search called the memetic algorithm to find multiple subgraphs that cover the input network (MA-MmCkSP). The memetic operation includes both long-distance and short-distance search and has proved to be effective in solving NP-hard problems [62,63].

Framework
The framework of MA-MmCkSP is shown in Algorithm 1. We first input necessary parameters and the adjacency matrix of the input network. An initial population P is generated that consists of a list of solutions (coded as chromosomes), and then the process is repeated until the maximum number of iterations is reached or the coverage of the input network remains unchanged over 50 iterations. At each iteration, tournament selection is used to select a parent population P parent with the highest representativeness. Next, we perform a genetic operation on P parent to form an offspring population P offspring . Then the local-search function is applied to find the local maximum solution for the offspring population. Then an updating function is used to construct a new population P with better solutions. After repeating, we output the fittest solution by decoding.

PLOS ONE
Finding multiple subgraphs that cover an input network

Until Termination (I max )
9. Decode (P) 10. Output: the best solution of the finding multiple subgraphs and its covering.

Representation and initialization
Each solution is encoded as a chromosome that consists of m substrings where m is the number of subgraphs. Each substring represents the node set in a subgraph and is denoted by a list of genes x 2 {1, 2, . . ., n} that specifies which nodes should be included. Fig  6 illustrates the representation for a subgraph of size 5, and the number of subgraphs is set to 4, so the chromosome is formed as five genes with four substrings. If we change the 5 th gene from 5 to 10 in the first substring, the new solution will substitute node 10 for node 5 in the first subgraph.
For the initialization, we generate a population and randomly select the nodes for each substring in every chromosome.

Genetic operation
The genetic operation includes both crossover and mutation, which are the primary operations in the genetic algorithm. The algorithm performs the crossover procedure with probability P c ,

PLOS ONE
Finding multiple subgraphs that cover an input network and executes the mutation procedure with probability P m = 1−P c . To some extent crossover represents long-term search, while mutation represents short-term search. Thus appropriate setting of P m = 1−P c enables a balance to be found between long-term and short-term search, which helps to increase the efficiency of the genetic algorithm [64,65].
In the crossover operation, two parental chromosomes are chosen using tournament selection. We first disorganize the order of the substrings for each chromosome to maintain diversity, and then find the genes that differ between the chromosomes in each substring. Given each pair of different genes, we generate a random number γ; if γ< 0.5, the gene remains unchanged; and if γ � 0.5, the corresponding genes are swapped between the two chromosomes. Finally, we add the common genes and form the two offspring chromosomes. The crossover operation is illustrated in Fig 7. After changing the substring disorder, substring 3 in parent 1 and substring 2 in parent 2 are reassigned to the first substring. The genes that differ between parent 1' and parent 2' are grey. Since the generated random numbers are 0.3, 0.6, 0.9, and 0.4, respectively, for the first substring, we swap the second and third different genes between the two parental chromosomes because the corresponding γ � 0.5.
In the mutation operation, we randomly select an element x i in each substring and then randomly assign a different node number that is also different from other node numbers within the same substring as the element x i .

Local search
Local search is effective in reducing inefficient exploration and not only improves the accuracy but also speeds up the convergence [64][65][66]. Here we employ a hill-climbing technique presented as Algorithm 2. We check each element in a chromosome and replace the original gene with a node number that increases the objective function (the coverage of the input network) on substitution. The chromosome can then reach a local optimum.

PLOS ONE
Finding multiple subgraphs that cover an input network

Complexity analysis
Given a network with N nodes, number of subgraphs m and the subgraph size k, the time-complexity of MA-MmCkSP is analyzed as follow. At each iteration, we need to execute the crossover operation S pool /2 times (where S pool is the size of the mating pool) and the mutation operation S pool times at most. Since computing the covering costs O(mk), the time-complexity for performing the genetic operation is O(mkS pool ). In the local search procedure, finding the best neighbor for each gene needs O(Nmk), and thus to find the local optimal chromosome will cost O(Nm 2 k 2 ). Since O(mkS pool ) < O(Nm 2 k 2 ), the total time complexity of the proposed algorithm is O(Nm 2 k 2 ).

Results
In this section, we show the effectiveness and efficiency of MA-MmCkSP running on a computer-generated random network. We also carry out the procedure on various real-world networks and interpret the optimal method in the social context. The experiments were carried out on a 2.11 GHz CPU with 16.00 GB memory computer, running on Windows 10 using MATLAB to execute the procedure. Table 1 shows the parameters used in the experiments that gave the best performance for the proposed algorithms.

Results for computer-generated networks
In order to assess the effectiveness of MA-MmCkSP, we compare it with random extraction (RE), the big-degree sampling method where big-degree nodes have a higher probability of being extracted (BD-MmCkSP), the greedy algorithm based on the operation of local search (GR-MmCkSP) and the genetic algorithm without local search (GA-MmCkSP). The five methods were carried out on an ER random network consisting of 100 nodes and 1,000 edges.

PLOS ONE
Finding multiple subgraphs that cover an input network from MA-MmCkSP are densest, GR-MmCkSP are the second densest, GA-MmCkSP are the third densest, while subgraphs in BD-MmCkSP and RE are sparsest. The results suggest that the optimal extracted subgraphs are more likely to be denser. The proposed algorithm can provide a new alternative for solving the multiple-m densest subgraphs problem (MmDSP). In addition, we find the density of subgraphs increases as the subgraph size increases, but decreases as the number of subgraphs decreases. There is a tradeoff among the subgraph density, subgraph size and the number of subgraphs.
We also compared the results obtained using MA-MmCkSP with those using GA-MmCkSP on each iteration, and we see that the memetic operation is more efficient. Fig 10 show the results for the two methods with different settings for subgraph size and number. MA-MmCkSP performs much better and converges faster than GA-MmCkSP. The difference is especially apparent in Fig 10D, where MA-MmCkSP is able to reach a covering of 100% at the first iteration, while GA-MmCkSP converges after the 80 th iteration and even then does not reach 100%.
In order to find characteristics of the extracted nodes, we compute the correlation between the number of times each node is selected using MA-MmCkSP and the network centrality, as shown in Fig 11. Comparing Figs 8A and 11A, when the network cannot be completely collected (i.e., the subgraph size is smaller than 37), the probability of a node being selected is highly correlated with its centrality. The correlation dramatically decreases as the boundary size surpasses the critical value. Fig 11B shows a similar result if central nodes are more likely to be included repeatedly. The results suggest that including central nodes is helpful in achieving the network covering.
A sensitivity analysis for the proposed algorithm on a network with 1,000 nodes and 10,000 edges is conducted. The results show that MA-MmCkSP still performs the best in maximizing the coverage of networks of larger size as shown in Fig 12. We also test the performance of MA-MmCkSP for networks with different average densities and find that the extracted subgraphs are less covering given the fixed number of subgraphs and the subgraph size when the density of the input network increases as shown in Fig 13. A subgraph of larger size is required if we aim to investigate a denser social network.

Results for real-world networks
In this section, we test RE, GA-MmCkSP and MA-MmCkSP on six real-world networks, namely Zachary's Karate Club network, Bottlenose Dolphins network, American College

PLOS ONE
Finding multiple subgraphs that cover an input network Football network and three migrant workers' networks of ADS, YDSC and WH companies in Shenzhen, China.
Zachary's Karate Club network consists of 34 karate-club members and 78 social ties observed by Zachary over two years [67]. The Bottlenose Dolphins network was constructed by Lusseau [68], who observed 62 bottlenose dolphins and their 159 connections over seven

PLOS ONE
Finding multiple subgraphs that cover an input network years. The American College Football network was constructed from the schedule of Division Ⅰ games during the year 2000 football season. The network consists of 115 nodes that represent teams and 616 edges that represent the regular season games between the two teams that they connect [7]. The next three examples are networks of migrant workers in ADS, YDSC, and

PLOS ONE
Finding multiple subgraphs that cover an input network WH companies investigated by the New Urbanization and Sustainable Development Group of Xi'an Jiaotong University [65]. The three networks were constructed from a single-item

PLOS ONE
Finding multiple subgraphs that cover an input network question that asked the participant to enumerate individuals with whom they are often in contact at work. ADS network consists of 165 nodes and 1196 edges; YDSC network consists of 70 nodes and 272 edges; WH network consists of 193 nodes and 887 edges. The survey involved both network-level and individual-level investigations.
For each network, the number of subgraphs is m = 5, 10, or 30, and the subgraph size is chosen from k = 0 .1N, k = 0.2N, k = 0.3N, k = 0.4N, k = 0.5N, where N is the network size. Table 2 shows the mean and maximum value of representativeness over 10 runs produced by RE, BD-MmCkSP, GR-MmCkSP, GA-MmCkSP and MA-MmCkSP with different values of m and k. We find MA-MmCkSP performs much better than other algorithms. Moreover, subgraph size k plays a much more important role in the multiple extractions: a small increase in k can produce a large improvement in covering. Even random extraction is able to cover all the edges when k reaches 0.4N.
By decoding the best chromosomes generated by the proposed algorithm, we can extract the specific sampling solution in each subgraph. Fig 14 presents one of the best extraction methods for Zachary's Karate Club network with k = 0.3N�10 and m = 5. The present solution is able to collect all the edges, i.e. the covering C = 100%. In Zachary's Karate Club network, nodes 1, 2, 3, 33, 34 are key individuals who have the highest centrality, and we find that at least two central nodes are needed to include as many edges as possible. However, there is no solution that includes all five central nodes within the same subgraph. This is because a pair of central nodes may be disconnected, while including these nodes may not collect any edges. For example, investigating nodes 1, 3, 34 cannot collect any edges although they have important positions in the network. This suggests that including the central nodes is important, but extracting only the central nodes may not lead to a result that gives the best coverage.
We find that the optimal method is associated with the community structure. Zachary's Karate Club network is a typical network with characteristic community structure [67]. The network can be naturally divided into two communities where edges are denser within the same community but sparser between the different communities. Fig 15 presents the optimized solution of Zachary's Karate Club network with k = 0.2N�7 and m = 5. This solution cannot collect all the edges (C = 0.769) because of the limited subgraph size. Most of the extracted nodes within the same communities are included in the one independent subgraph, which suggests that collecting nodes within the same community in each subgraph is helpful for collecting as many edges as possible; we call this the "community collecting method" (CCM). However, the edges between different communities can be hard to detect using CCM. Therefore, CCM is appropriate where the subgraph size or number are so limited that the optimized solution cannot collect all the edges (in other words, C<1). Another limitation of CCM is that it may not work effectively on networks without community structure (modularity Q<0.3) [7].
In order to test the performance of CCM, we ran MA-MmCkSP on the benchmark networks proposed by Lancichinetti et al. [69]. Each network consists of 128 nodes with the average degree of 16. These nodes are evenly assigned one of the clustering attributes {1, 2, 3, 4}. We introduce a mixing parameter that denotes the fraction of edges for one node linking to other nodes with different clustering attributes. A higher mixing parameter represents a smaller modularity of the input network. We generated nine networks for values of mixing parameter ranging from 0 to 0.5. Fig 16 shows the covering result for different mixing parameters given the number of subgraphs m = 10 and the subgraph size k = 10, 20 and 30. We find that the extracted subgraphs are less covering as the mixing parameter increases. This is because the optimal solution is based on CCM, while CCM performs less efficiently when the feature of community structure of the input network declines.

Conclusion and discussion
The present study provides a new perspective on addressing the multiple densest subgraph problem. We advance research on this topic by formulating the problem of covering the input network as an optimization problem and propose a model that maximizes the covering of the observed network by extracting multiple subgraphs. A memetic algorithm combined with a genetic algorithm and local search optimizes the extraction in each independent subgraph. The proposed algorithm can solve the optimization problem effectively. Compared to adding the number of extractions, increasing subgraph size is more helpful in improving the coverage of the network. Including nodes with higher centrality is necessary, but investigating only those nodes cannot fully reproduce the input network structure because the common edges connected with normal crowds (nodes with lower centrality) can easily be ignored. When subgraph size or numbers are constrained, the community collecting method, which includes nodes within the same community in each subgraph, can be an effective way of enhancing the covering. A suggestion for practitioners is to recognize the potential community structure of research objects before conducting the extractions.
From a sociological review, previous research has highlighted the effectiveness of random sampling [27,45,57], but this method is not effective when surveys are conducted repeatedly. This is because random sampling in multiple surveys leads to redundancy, where an edge may be detected many times. The top-down sampling method (choosing the top nodes ordered by size) is also of limited value in repeated surveys, because edges connected by nodes with different rank sizes cannot be collected. Including central nodes helps to enhance the covering, but including only the representative nodes may not lead to a representative result. On the other hand, node size is difficult to estimate precisely in social networks. Before acquiring the whole structure of a network, it is difficult to judge whether an individual is a central or marginal member. An illustration is presented in Fig 14, which shows the difference between different methods in recognizing core nodes. The network in Fig 17 is the ADS migrant workers' network. By asking "how many friends or acquaintances do you have in Shenzhen (ADS is located in this city)?" in the individual-level questionnaire, we can divide the company members into "big-size" individuals, who have 30 or more friends or acquaintances, and small-size

PLOS ONE
Finding multiple subgraphs that cover an input network individuals, who do not have as many as 30 friends (see Fig 17A). By applying the core-periphery model [5] to the whole network, we can also find big-size individuals and small-size individuals as shown in Fig 17B. This figure is derived using Eq (5), where α ij is relationship between nodes i and j, and c i is one of node i's attributes (core or periphery), "•" indicates a missing value which treat the off-diagonal regions of α ij as missing data that helps maximize density in the core and minimize density in the periphery. The inconsistency between (a) and (b) suggests that top-down sampling may choose some fake big-size nodes which undermines the accuracy of network estimation.
Maximizer ¼ In most natural settings, practitioners have no idea as to the real structure of the actual network. In order to collect all the potential edges, practitioners should assume that the actual network is completely connected, so that each pair of nodes is connected. The proposed algorithm can also be applied to the covering for a completely connected network. Despite the merits of these new proposals, there are some limitations to the present study. The algorithm may sometimes be trapped in a local maximum, and we plan to design a more intelligent algorithm in the future. The objective function (coverage of the input network) in this paper is the detected number of edges divided by the total number of edges, while other indices, such as centrality, might also be employed in the optimization model. A meaningful analysis of social networks requires both individual-level and network-level investigations, and thus an index for measuring the covering of multiple subgraphs that considers both individual and relational attributes needs to be designed in the future.